15 research outputs found

    Adaptive transaction scheduling for transactional memory systems

    Get PDF
    Transactional memory systems are expected to enable parallel programming at lower programming complexity, while delivering improved performance over traditional lock-based systems. Nonetheless, there are certain situations where transactional memory systems could actually perform worse. Transactional memory systems can outperform locks only when the executing workloads contain sufficient parallelism. When the workload lacks inherent parallelism, launching excessive transactions can adversely degrade performance. These situations will actually become dominant in future workloads when large-scale transactions are frequently executed. In this thesis, we propose a new paradigm called adaptive transaction scheduling to address this issue. Based on the parallelism feedback from applications, our adaptive transaction scheduler dynamically dispatches and controls the number of concurrently executing transactions. In our case study, we show that our low-cost mechanism not only guarantees that hardware transactional memory systems perform no worse than a single global lock, but also significantly improves performance for both hardware and software transactional memory systems.M.S.Committee Chair: Lee, Hsien-Hsin; Committee Member: Blough, Douglas; Committee Member: Yalamanchili, Sudhaka

    SoftCache Architecture

    Get PDF
    Multiple trends in computer architecture are beginning to collide as process technology reaches ever smaller feature sizes. Problems with managing power, access times across a die, and increasing complexity to sustain growth are now blocking commercial products like the Pentium 4. These problems also occur in the embedded system space, albeit in a slightly different form. However, as process technology marches on, today's high-performance space is becoming tomorrow's embedded space. New techniques are needed to overcome these problems. In this thesis, we propose a novel architecture called SoftCache to address these emerging issues for embedded systems. We reduce the on-die memory controller infrastructure which reduces both power and space requirements, using the ubiquitous network device arena as a proving ground of viability. In addition, the SoftCache achieves further power and area savings by converting on-die cache structures into directly addressable SRAM and reducing or eliminating the external DRAM. To avoid the burden of programming complexity this approach presents to the application developer, we provide a transparent client-server dynamic binary translation system that runs arbitrary ELF executables on a stripped-down embedded target. One drawback to such a scheme lies in the overhead of additional instructions required to effect cache behavior, particularly with respect to data caching. Another drawback is the power use when fetching from remote memory over the network. The SoftCache comprises a dynamic client-server translation system on simplified hardware, targeted at Intel XScale client devices controlled from servers over the network. Reliance upon a network server as a ``backing store' introduces new levels of complexity, yet also allows for more efficient use of local space. The explicitly software managed aspects create a cache of variable line size, full associativity, and high flexibility. This thesis explores these particular issues, while approaching everything from the perspective of feasibility and actual architectural changes.Ph.D.Committee Co-Chair: Lee, Hsien-Hsin Sean; Committee Co-Chair: Ramachandran, Kishore; Committee Member: Mackenzie, Kenneth; Committee Member: Pande, Santosh; Committee Member: Schimmel, David; Committee Member: Schwan, Karste

    Chameleon: Virtualizing Idle Acceleration Cores of A Heterogeneous Multi-Core Processor for Caching and Prefetching

    Get PDF
    Heterogeneous multi-core processors have emerged as an energy- and area-efficient architectural solution to improving performance for domain-specific applications such as those with a plethora of data-level parallelism. These processors typically contain a large number of small, compute-centric cores for acceleration while keeping one or two high-performance ILP cores on the die to guarantee single-thread performance. Although a major portion of the transistors are occupied by the acceleration cores, these resources will sit idle when running unparallelized legacy codes or the sequential parts of an application. To address this under-utilization issue, in this paper, we introduce Chameleon, a flexible heterogeneous multi-core architecture to virtualize these resources for enhancing memory performance when running sequential programs. The Chameleon architecture can dynamically virtualize the idle acceleration cores into a last-level cache, a data prefetcher, or a hybrid between these two techniques. In addition, Chameleon can operate in an adaptive mode which dynamically configures the acceleration cores between the hybrid mode and the prefetch-only mode by monitoring the effectiveness of Chameleon caching scheme. In our evaluation using SPEC2006 benchmark suite, different levels of performance improvements were achieved in different modes for different applications. In the case of the adaptive mode, Chameleon improves the performance of SPECint06 and SPECfp06 by 33% and 22% on average. When considering only memory-intensive applications, Chameleon improves the system performance by 53% and 33%

    Investigating a SoftCache via Dynamic Rewriting

    No full text
    Software caching via binary rewriting enables networked embedded devices to have the benefits of a memory hierarchy without the hardware costs. A software cache replaces the hardware cache/MMU mechanisms of the embedded system with software management of on-chip RAM using a network server as the backing store. The bulk of the software complexity is placed on the server so that the embedded system contains only the application's current working set and a small runtime system invoked on cache misses. We present a design and implementation of instruction caching using an ARM-based embedded system and a separate server and detail the issues discovered. We show that the software cache succeeds at discovering the small working set of several test applications for a reduction of 7 to 14X of the original application code. Further, we show that our software overhead remains small for typical functions in embedded applications. Finally, we discuss the implications of software caching for dynamic optimization, for power savings and for hardware design

    SoftCache: Dynamic Optimizations for Power and Area Reduction in Embedded Systems (II)

    Get PDF
    We propose a SoftCache for low-power and reduced die area while providing application flexibility. Our implementations demonstrate that the network is a power efficient means for accessing remote memory. The impact of this work suggests that SoftCache systems may be useful in future consumer electronics. Our results show that die power is reduced by 20%, die area is reduced by 10%, and trans- ferring applications over the network is more energy-delay effective than local DRAM

    Software Caching using Dynamic Binary Rewriting for Embedded Devices

    No full text
    A software cache implements instruction and data caching entirely in software. Dynamic binary rewriting offers a means to specialize the software cache miss checks at cache miss time. We describe a software cache system implemented using dynamic binary rewriting and observe that the combination is particularly appropriate for the scenario of a simple embedded system connected to a more powerful server over a network. As two examples, consider a network of sensors with local processing or cell phones connected to cell towers. We describe two software cache systems for instruction caching only using dynamic binary rewriting and present results for the performance of instruction caching in these systems. We measure time overheads of 19% compared to no caching. We also show that we can guarantee a 100% hit rate for codes that fit in the cache. For comparison, we estimate that a comparable hardware cache would have space overhead of 12-18% for its tag array and would offer no hit rate guarantee

    The elusive metric for low-power architecture research

    No full text
    The Energy-Delay product or ED product was proposed as a metric to gauge design effectiveness. This metric is widely used in the area of low-power architecture research, however it is also often improperly used when reporting a new architecture design that addresses energy-performance effectiveness. In this paper, we discuss two common fallacies from the literature: (1) the way the ED product is calculated, and (2) issues of introducing additional hardware structures to reduce dynamic switching activities. When using the ED product without meticulous consideration, a seemingly energy-efficient design could turn out to be a more energy-consuming one. 1

    POD: A Parallel-On-Die Architecture

    Get PDF
    As power constraints, complexity and design verification cost make it difficult to improve single-stream performance, parallel computing paradigm is taking a place amongst mainstream high-volume architectures. Most current commercial designs focus on MIMD-style CMPs built with rather complex single cores. While such designs provide a degree of generality, they may not be the most efficient way to build processors for applications with inherently scalable parallelism. These designs have been proven to work well for certain classes of applications such as transaction processing, but they have driven the development of new languages and complex architectural features. Instead of building MIMD-CMPs for all workloads, we propose an alternative parallel on-die many-core architecture called POD based on a large SIMD PE array. POD helps to address the key challenges of on-chip communication bandwidth, area limitations, and energy consumed by routers by factoring out features necessary for MIMD machines and focusing on architectures that match many scalable workloads. In this paper, we evaluate and quantify the advantages of the POD architecture based its ISA on a commercially relevant CISC architecture and show that it can be as efficient as more specialized array processors based on one-off ISAs. Our single-chip POD is capable of best-in-class scalar performance up to 1.5 TFLOPS of single-precision floating-point arithmetic. Our experimental results show that in some application domains, our architecture can achieve nearly linear speedup on a large number of SIMD PEs, and this speedup is much bigger than the maximum speedup that MIMD-CMPs on the same die size can achieve. Furthermore, owing to synchronized computation and communication, it shows that POD can efficiently suppress energy consumption on the novel communication method in our interconnection network
    corecore